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ABSTRACT 

Evolutionary analyses of biological data are 
becoming a prerequisite in many fields of biology. 
At a time of high-throughput data analysis, phylo- 
genetics is often a necessary complementary tool 
for biologists to understand, compare and identify 
the functions of sequences. But available bioinfor- 
matics tools are frequently not easy for non- 
specialists to use. We developed PhyleasProg 
(http://phyleasprog.inra.fr), a user-friendly web 
server as a turnkey tool dedicated to evolutionary 
analyses. PhyleasProg can help biologists with 
little experience in evolutionary methodologies by 
analysing their data in a simple and robust way, 
using methods corresponding to robust standards. 
Via a very intuitive web interface, users only need to 
enter a list of Ensembl protein IDs and a list of 
species as inputs. After dynamic computations, 
users have access to phylogenetic trees, positive/ 
purifying selection data (on site and branch-site 
models), with a display of these results on the 
protein sequence and on a 3D structure model, 
and the synteny environment of related genes. This 
connection between different domains of phylogen- 
etics opens the way to new biological analyses for 
the discovery of the function and structure of 
proteins. 

INTRODUCTION 

Today, more and more eukaryotic genomes have been 
sequenced thanks to second-generation sequencing 
technologies thereby providing an extraordinary wealth 
of information for evolutionary analyses. Currently, the 
GOLD web site (1) lists more than 3000 eukaryotic 
genomes whose sequencing is complete or ongoing. 
Under these circumstances, bioinformatics tools could 



help to understand the evolutionary histories of proteins 
especially by connecting phylogenetics analysis and 
positive selection calculations. These approaches consti- 
tute the core of many biological research areas, and as 
stated by Theodosius Dobzhansky 'Nothing in biology 
makes sense except in the light of evolution'. Indeed, 
present protein sequences are the result of a long, 
complex and extensive evolutionary process. Proteins 
have different levels of conservation. Active sites or 
protein-protein interaction domains are often well 
conserved, while highly variable regions may carry sites 
under positive selection. Such positively selected sites 
may be interpreted as being a consequence of molecular 
adaptation, which may confer an evolutionary advantage 
to the organism (2—4). 

Accordingly, the association of (i) the establishment of 
orthology and paralogy relationship; (ii) the functional 
inference by reconstruction of the phylogenetic tree; and 
(iii) the identification of sites/genes under positive selec- 
tion is an important step, not only in studies of evolution- 
ary biology, but also in functional studies. By projecting 
the results of positive selection onto the 3D structure of 
proteins, this becomes a powerful and very useful tool for 
biologists. The combined data could help biologists plan 
site-directed mutagenesis experiments. However, obtain- 
ing a phylogenetic tree requires successive computations 
including identification of homologous sequences, multiple 
alignment, phylogenetic reconstructions and graphic rep- 
resentation of the inferred tree. Obtaining positive selec- 
tion data require the use of mathematical methods, such as 
PAML (5), which are designed for specialists. 

Several web sites offer phylogenetic tree reconstruction. 
Some are turnkey systems such as PhyloBuilder (6) and 
POWER (7). Some offer a single tool, while others bring 
together many of the most popular programs for phylo- 
genetic reconstruction such as Mobyle (8). The web server 
Phylogeny.fr (9) is designed for non-specialists and has 
up-to-date programs that are often designed for experts. In 
parallel, two phylogenetic tree databases, PhylomeDB (10) 
and TreeFam (11), offer a large number of pre-computed 
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trees based on all genes of all genomes. A number of 
web sites are also available for analysing evolution- 
ary forces. The web server Selecton (12) offers a 
user-friendly tool to compute positive selection and 
displays results on a 3D structure of proteins. However, 
it only allows calculation of one set of orthologues. The 
DataMonkey server (13) enables detection of signatures 
of positive and negative selection from coding sequence 
alignments using a wide range of statistical models. 
The Selectome (14) database provides the results of a 
branch-site-specific likelihood test for positive selection 
based on whole gene families from the TreeFam 
database. Phylemom (15) enables experts to build a 
complete pipeline dedicated to phylogenetics and 
evolution. 

Many tools are already available to reply to phylogen- 
etics and evolutionary questions. However, they are 
complex to use and do not allow all the necessary compu- 
tations to be carried out on a single server. Phylogenetic 
tree reconstruction, positive selection detection and 
protein 3D structure modelling require (i) installation/ 
use of numerous tools; (ii) knowledge of up-to-date 
tools; and (hi) substantial computational resources. In 
particular, when biologists analyse several proteins of 
interest, they want to repeat bioinformatics methods on 
their data in the same conditions and they want to obtain 
results in a reasonable amount of time. This is why we 
built PhyleasProg web server in such a way that it could 
be used by the largest possible number of biologists. Our 
aim was to combine usefulness and usability. Such a server 
is a helpful guide for biologists with little experience in 
evolutionary methodologies as it can analyse their data 
in a simple and robust way, using methods corresponding 
to well-accepted standards. 

Via a very simple interface, users enter one or a list of 
Ensembl protein IDs (16) and choose a set of species 
about which they wish to obtain evolutionary information 
among the sequenced vertebrates in Ensembl. Once 
submitted, each ID is treated independently and the com- 
putations are performed on both orthologues and para- 
logues of the related genes. As output, PhyleasProg 
provides (i) phylogenetic trees; (ii) positive/purifying selec- 
tion data (on site and branch-site models) with visualiza- 
tion of these outcomes on the protein sequence and 
whenever possible, on a 3D structure; and (hi) the 
genomic environment of related genes. To our knowledge, 
no other web server performs all these tasks on several in- 
put sequences simultaneously. In addition, PhyleasProg 
computes the degree of purifying selection and positive 
Darwinian selection for each site in the protein sequence 
and displays these data on the modelled molecular struc- 
ture of the protein. To guide users through these different 
evolutionary methods, which are not always very easy for 
non-experts, the pipeline only returns results if they are 
statistically significant. 

This unique connection between phylogenetic trees, 
synteny studies, positive/purifying selection data and 3D 
structures opens the way to new biological analyses to 
improve our understanding of function and structure of 
proteins. 



OVERVIEW 

The PhyleasProg pipeline is a combination of Perl 
modules and external software (Figure 1). As input data, 
it requires one or a list of Ensembl protein IDs and a list of 
species selected among completely or partially sequenced 
vertebrates in Ensembl (16). Once the process is complete, 
users can obtain evolutionary results on each ID 
submitted, treated independently but simultaneously, on 
orthologues and paralogues of the related genes. 

We intentionally chose to not embed an exhaustive num- 
ber of similar methodologies in our platform. We chose 
rapid, up-to-date, accurate and proven tools. Multiple 
sequence alignments are performed by MUSCLE (17) 
and are refined by GBLOCKS (18), itself improved by a 
home-made Perl program. TREEBEST (http://treesoft 
.sourceforge.net/treebest.shtml) reconstructs phylogenetic 
trees. CODEML, a PAML program (5), performs positive 
selection computation. MODELLER (19) builds 
homology models of the 3D structure of proteins. 

Data visualization was an important goal for the devel- 
opment of this platform. JALVIEW (20) is used to display 
multiple sequence alignments, ARCHAEOPTERYX (21) 
for interactive manipulation of phylogenetic trees and 
JMOL (22) to display the 3D structure of proteins. We 
were careful to present processes and results very simply to 
enable biologists to navigate through a user-friendly 
environment. To guide users, the pipeline only returns 
significant results. Moreover, all input and output data 
can be downloaded as flat files. 

A cluster computer manages the execution of the whole 
pipeline. This choice allows a very reasonable execution 
speed and authorizes PhyleasProg to work on several 
proteins simultaneously. The user interface was optimized 
for Firefox browser developed in Perl CGI. 

PHYLEASPROG PIPELINE 

Data acquisition 

Input. For a very simple use of PhyleasProg, only 
Ensembl IDs of the proteins to be studied and a list of 
the species with which they should be compared are re- 
quired as inputs. Protein IDs can be separated by a 
comma, a space or a new line character. Ensembl 
protein IDs are unique, they start with 'ENS' and their 
last letter must be a 'P' (e.g. ENSMUSP00000099398). To 
choose species for which they want evolutionary results, 
users simply tick the name of the species in the lists of 
completely and partially sequenced genomes. The Job 
summary page summarizes the list of IDs submitted, 
the selected species and displays the status of process for 
each ID. 

Interrogation of Ensembl database. We chose to work with 
Ensembl protein IDs because Ensembl provides 
high-quality genome annotation across vertebrate species 
and allows computer scientists to retrieve a lot of data 
very quickly, thanks to a Perl application programming 
interface (API) (23). 

Using this API, for each protein ID submitted, we 
retrieved protein and related transcript sequences, 
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Figure 1. The workflow of PhyleasProg web server. 



related gene ID, orthologous and paralogous protein 
IDs, orthologue and paralogue protein sequences and 
related transcript sequences (Figure 1A and A'). Among 
the numerous orthologues identified in Ensembl, we 



chose to keep either the one-to-one orthologues or 
the related gene with the shortest evolutionary dis- 
tance among the one-to-many or the many-to-many 
orthologues (24). 



W482 Nucleic Acids Research, 2011 , Vol. 39, Web Server issue 



Reconstruction of phylogenetic trees 

Multiple sequence alignment and refinement. For each 
protein ID submitted, PhyleasProg reconstructs phylogen- 
etic trees of both orthologues and paralogues. And for 
each orthologue related to one of the protein IDs 
submitted, a phylogenetic tree of paralogues is also 
reconstructed. 

As shown in Figure IB, multiple sequence alignment 
(MSA) of proteins is generated by MUSCLE. This align- 
ment is then converted into multiple codon alignment by 
PAL2NAL (25). As our pipeline offers a turnkey process, 
we had to pay particular attention to the quality of MSA 
because this is essential for the quality of the related 
phylogenetic tree. Thus, GBLOCKS is used to edit 
MSA. This software removes all sites containing at least 
one gap and sites that are too divergent because these 
positions might not be homologous or might be saturated 
by multiple substitutions. First of all, GBLOCKS is per- 
formed with strict parameters (type = codons; maximum 
number of contiguous non-conserved positions = 8; 
minimum length of a block = 10; no gaps allowed). 
After this first step, the generated MSA can be very 
short, which would seriously damage the rest of the com- 
putations in the PhyleasProg pipeline. Consequently, re- 
finement step are performed recursively: if after 
GBLOCKS, the MSA length is <30% of the median 
length of sequences in the raw MSA, the sequence that 
induces most of the gaps is removed from the dataset, 
and a new MSA is computed. If the length of the clean 
MSA is between 30 and 50%, a new editing with 
GBLOCKS is performed on the raw MSA with relaxed 
parameters (type = codons; maximum number of con- 
tiguous non-conserved positions = 10; minimum length 
of a block = 5; no gaps allowed). If after this last step, 
the length of the MSA is still too short, computation 
is aborted. Thus, it is important to estimate the quality 
of the MSA (downloadable through the fiat files menu) be- 
fore analysing the other results of the pipeline (Figure 2). 

Phylogenetic reconstruction. The clean MSA from the 
previous step is used to reconstruct the phylogenetic tree 
by TreeBeST (Figure 1C). TreeBeST integrates multiple 
tree topologies, in particular both DNA- and protein-level 
models and combines them with a species-tree aware pen- 
alization of topologies, which is inconsistent with known 
species relationships. TreeBeST is run with the option 
best. This enables the combination of (i) a maximum like- 
lihood (ML) tree built using PhyML (26) based on the 
protein alignment with the Whelan And Goldman 
model; (ii) a ML tree built using PhyML based on the 
codon alignment with the Hasegawa-Kishino-Yano 
(HKY) model; (iii) a neighbour-joining (NJ) tree using 
/^-distance based on the codon alignment; (iv) a NJ tree 
using AN distance (rate of non-synonymous substitutions) 
based on the codon alignment; and (v) a NJ tree using AS 
distance (rate of synonymous substitutions) based on the 
codon alignment. As TreeBeST runs with a species tree, 
the final phylogenetic tree is rooted by minimizing gene 
duplications and then losses, the best rooting strategy for 
this type of input. 



Visualization. Archaeopteryx, the successor of ATV (27), 
is a Java application used as applet for the display and 
manipulation of annotated phylogenetic trees. 

Positive/purifying selection calculations 

Overview. PhyleasProg gives positive and purifying selec- 
tion data using ML calculations which underlie the sto- 
chastic process of evolution. CODEML, from the package 
PAML (Figure ID) (5), evaluates the ratio of 
non-synonymous/synonymous substitution rates (AN/ 
AS), denoted co, which is a measure of selective pressure. 
Values of co< 1 , =1 and > 1 are indicators of purifying 
selection, neutral evolution and positive selection, respect- 
ively. Two distinct categories of codon substitution 
models are used: site models (Mia versus M2a, M7 
versus M8 and M8a versus M8) and branch-site models. 
For the two types of analyses, two models are compared: 
one model which allows positive selection and one model 
which does not allow positive selection. For each model, 
the InL (log likelihood) value is retrieved (lnL[ for the 
model allowing positive selection, lnL 0 for the other) 
and a LRT (likelihood ratio test) is calculated 
[LRT = 2 x (lnL[ — lnL 0 )] to assess the significance of 
the results. The LRT value follows a x l aw which 
allows the P-value of the LRT to be obtained. If the 
LRT is significant for the comparison, PhyleasProg lists 
sites under positive selection detected by Bayes empirical 
Bayes (BEB) with posterior probabilities >95% and sites 
under purifying selection. 

As shown in Figure 2, selection pressure data appear in 
two separate menus. One is dedicated to results of site 
models and the other one to results of branch-site 
models. In the second case, these models allow the co 
ratio to vary both among sites in the protein and across 
branches on the tree and aim to detect positive selection 
affecting a few sites along particular lineages (foreground 
branches). In the pipeline, all branches of the tree are 
tested as foreground branches for positive selection. Two 
models are used, one called alternative and one called null. 
In the alternative model, three classes of sites are admitted 
for the foreground branch, co 0 : AN/ AS < 1, co\. AN /AS = 1 
and co 2 : AN/AS> 1. In the null model, co 2 is fixed to 1. 
Significant results with branch-site models are accessible 
on a clickable tree. Branches under positive selection are 
represented by a purple star and are highlighted in green. 
Raw result files (rst) of CODEML are also available. 

Visualization. Results of selection pressure calculation 
with site and branch-site models share the same presenta- 
tion (Figure 2). Data are visualized on ID and 3D struc- 
tures on the same results page. A dropdown menu 
embedded in the positive selection results web page 
enables users to visualize data on each protein in the 
orthologue or paralogue dataset. For the two types of 
representations, a discrete colour scale is used to distin- 
guish the different values of co for each site. The scale from 
green to yellow represents purifying selection, i.e. co < 0.3, 
while red and orange represent positive selection with pos- 
terior probabilities >99% or 95%, respectively. White 
means that no information is available for this site 
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because no calculation was performed by CODEML due 
to at least one gap in the MSA at this position. Grey 
means results are not significant enough to infer either 
purifying or positive selection. To locate different amino 
acids in different organisms, the MSA used for PAML 
computation is displayed using the JalView applet. 

These data can greatly help biologists to plan 
site-directed mutagenesis experiments to target essential 
functional residues. This was the main reason to have 
PhyleasProg display results on a 3D structure, if one can 
be modelled (Figure IE). To model the 3D structure, a 
BLAST (28) search is performed to find a similar structure 
in the PDB database (29) in order to use it as a template to 
calculate a model with Modeller. 3D structure is some- 
times difficult to predict, mostly when the template is 
too distant from the sequence to be modelled. To avoid 
models of insufficient quality, a model is built only if: 
(i) the alignment between the sequence to be modelled 
and the length of the PDB template covers at least 80% 
of PDB sequence and at least 50% of the query sequence 
and (ii) the percentage of identity between the two se- 
quences is at least 50%. If the query sequence is shorter 
than the template, amino acids in the C- or N-terminal are 



removed. In order to enable users to locate differences 
between a raw query sequence and the model, the align- 
ment between the PDB sequence and raw query sequence 
is displayed using JalView. Hence, when a homology 
model can be built, evolutionary results are directly 
visualized on the modelled structure, while if homology 
modelling is not possible, results are only presented on 
the ID sequence. 

Synteny exploration 

In order to achieve complete evolutionary analysis of the 
protein submitted, PhyleasProg offers the possibility to 
explore the genetic environment of related genes. Indeed, 
in the results menu (Figure 2) the user has a link to 
Genomicus (30). This database is a synteny browser that 
can represent and compare numerous genomes in a broad 
phylogenetic view. In addition, Genomicus includes the 
reconstructed organization of ancestral gene, thus 
greatly facilitating interpretation of the data. We chose 
not to develop our own genome browser because this 
web tool is really accurate, complete, up-to-date, 
user-oriented and also based on Ensembl data. 
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CONCLUSION AND FUTURE DEVELOPMENTS 

With PhyleasProg, we offer biologists a tool specially 
developed for non-specialists of phylogenetics, which is 
user-oriented, fast, complete, up-to-date, ready-to-use 
and accessible via a web interface, and allows the user to 
submit several jobs at the same time. All computations are 
dynamically produced and displayed as soon as the results 
are available, so the user can begin to analyse results 
without waiting for the whole process to end. 

Thanks to the modular architecture of our pipeline, it is 
relatively easy to update and to incorporate new tools. In 
the short term, our main plan is to extend the range of 
possible inputs. With the present system, only proteins 
from organisms available in Ensembl can be treated in 
PhyleasProg. A FASTA sequence as input, for example, 
could be useful. We also want to let users upload their 
own PDB files. In the very near future, we will offer a 
3D structure model based on a multiple alignment 
including several proteins from the PDB database, which 
would improve the quality of the models. Finally, to 
provide more accurate pressure selection data, we are 
already thinking about a way to minimize the guanine- 
cytosine bias in positive selection results. 
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